AITopics

Technology: Information Technology > Artificial Intelligence (0.88)

Neural Information Processing SystemsFeb-16-2026, 22:47:18 GMT

A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning

Vision Transformer (ViT) has shown great power in learning from large-scale datasets. However, collecting sufficient data for expert knowledge is always difficult. To handle this problem, Cross-Domain Few-Shot Learning (CDFSL) has been proposed to transfer the source-domain knowledge learned from sufficient data to target domains where only scarce data is available.

artificial intelligence, machine learning, natural language, (16 more...)

Country:

North America > United States (0.14)
Asia > China > Hubei Province (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceNov-25-2025

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Tong, Jintao, Jin, Wenwei, Qin, Pengda, Li, Anqi, Zou, Yixiong, Li, Yuhong, Li, Yuhua, Li, Ruixuan

Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut

flowcut, large language model, machine learning, (18 more...)

2505.19536

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Huang, Lianghuan, Chang, Yingshan

Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

arXiv.org Artificial IntelligenceOct-14-2025

Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability--the recoverability of information from hidden states--and causality--the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation--what information is present versus what is used--and that their divergence can expose hidden computational circuits.

artificial intelligence, machine learning, prediction, (15 more...)

2510.09794

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

Neural Information Processing SystemsOct-10-2025, 11:07:40 GMT

9b77f07301b1ef1fe810aae96c12cb7b-Paper-Conference.pdf

cl token, domain information, information, (14 more...)

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China > Hubei Province (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)

Demilt, Ryan P., LaHaye, Nicholas, Tenneson, Karis

The View From Space: Navigating Instrumentation Differences with EOFMs

arXiv.org Artificial IntelligenceOct-7-2025

Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

artificial intelligence, machine learning, modality, (17 more...)

2510.03316

Country: North America > United States (0.69)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)

Hufe, Lorenz, Venhoff, Constantin, Dreyer, Maximilian, Lapuschkin, Sebastian, Samek, Wojciech

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

arXiv.org Artificial IntelligenceAug-29-2025

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

artificial intelligence, machine learning, natural language, (16 more...)

2508.2057

Country: Europe > Switzerland (0.28)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.49)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsAug-17-2025, 07:36:12 GMT

Supplemental Material

We found channels using Y ouTube's auto-generated'topic' pages, corresponding to entries in

data mining, machine learning, natural language, (21 more...)

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom (0.04)

Genre: Research Report (0.46)

Industry:

Media (1.00)
Leisure & Entertainment > Sports > Tennis (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Data Science > Data Mining (0.67)

Neural Information Processing SystemsAug-15-2025, 09:14:25 GMT

7c220a2091c26a7f5e9f1cfb099511e3-Supplemental.pdf

grid block, pixelshuffle, transgan, (13 more...)

Country: Oceania > Australia > New South Wales > Sydney (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Neural Information Processing SystemsMay-27-2025, 10:23:09 GMT

A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning

Vision Transformer (ViT) has shown great power in learning from large-scale datasets. However, collecting sufficient data for expert knowledge is always difficult. To handle this problem, Cross-Domain Few-Shot Learning (CDFSL) has been proposed to transfer the source-domain knowledge learned from sufficient data to target domains where only scarce data is available. In this paper, we find an intriguing phenomenon neglected by previous works for the CDFSL task based on ViT: leaving the CLS token to random initialization, instead of loading source-domain trained parameters, could consistently improve target-domain performance. We find the CLS token naturally absorbs domain information due to the inherent structure of the ViT, which is represented as the low-frequency component in the Fourier frequency space of images. Based on this phenomenon and interpretation, we further propose a method for the CDFSL task to decouple the domain information in the CLS token during the source-domain training, and adapt the CLS token on the target domain for efficient few-shot learning.

cl token, cross-domain few-shot learning, domain information, (7 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)